PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement

Abstract

Despite recent advances in video generation, existing models still lack fine-grained controllability, especially for multi-subject customization with consistent identity and interaction. In this paper, we propose PolyVivid, a multi-subject video customization framework that enables flexible and identity-consistent generation. To establish accurate correspondences between subject images and textual entities, we design a VLLM-based text-image fusion module that embeds visual identities into the textual space for precise grounding. To further enhance identity preservation and subject interaction, we propose a 3D-RoPE-based enhancement module that enables structured bidirectional fusion between text and image embeddings. Moreover, we develop an attention-inherited identity injection module to effectively inject fused identity features into the video generation process, mitigating identity drift. Finally, we construct an MLLM-based data pipeline that combines MLLM-based grounding, segmentation, and a clique-based subject consolidation strategy to produce high-quality multi-subject data, effectively enhancing subject distinction and reducing ambiguity in downstream video generation. Extensive experiments demonstrate that PolyVivid achieves superior performance in identity fidelity, video realism, and subject alignment, outperforming existing open-source and commercial baselines. More comprehensive video results and comparisons are shown on the project page in the supplementary material.

Comparison with State-of-the-Art Methods

We compare our model on two-subject video customization with the state-of-the-art methods, inluding Keling, Vidu, Pika, Skyreels A2, and VACE.

Prompt: A woman is dressed in elegant attire, dancing gracefully beneath a tall building.

Prompt: A woman wearing a pink blazer is showcasing a Chanel lip gloss.

Prompt: A young girl is holding a roasted duck.

Prompt: A woman wearing a white tank top is holding an iPhone.

Prompt: A man and a woman walks hand in hand on the road.

Prompt: A tiger is fighting with a giraffe.

Prompt: A giraffe is fighting with a giraffe.

Comparison with State-of-the-Art Methods (Three Subjects)

We compare our model on three-subject video customization with the state-of-the-art methods, inluding Keling, Vidu, Pika, Skyreels A2, and VACE.

Prompt: A man is drinking coffe on the sofa.

Prompt: A person riding on a tiger, holding an umbrella.

More Multi-subject Customization Results

We show more multi-subject customization results. It can be observed that our model is capable of generating natural and realistic interactions between various types of inputs, demonstrating its potential effectiveness in applications such as advertising and movie production. Furthermore, beyond object interactions, our model can also generate specified subjects within assigned scenes, which is particularly useful for personalized content creation and other creative industries.

More Multi-subject Customization Results (Three Subjects)

We show more three-subject customization results, featuring diverse combinations such as human-animal-animal, human-object-animal, human-animal-scene, and human-object-object. These results illustrate that our model can effectively handle different combinations of inputs and generate complex interactions among multiple subjects, all while maintaining strong identity preservation. This demonstrates the superior capability of our model in customized video generation for multi-subject scenarios.